Tracing a Loose Wordhood for Chinese Input Method Engine

نویسندگان

  • Xihu Zhang
  • Chu Wei
  • Hai Zhao
چکیده

Chinese input methods are used to convert pinyin sequence or other Latin encoding systems into Chinese character sentences. For more effective pinyin-to-character conversion, typical Input Method Engines (IMEs) rely on a predefined vocabulary that demands manually maintenance on schedule. For the purpose of removing the inconvenient vocabulary setting, this work focuses on automatic wordhood acquisition by fully considering that Chinese inputting is a free human-computer interaction procedure. Instead of strictly defining words, a loose word likelihood is introduced for measuring how likely a character sequence can be a user-recognized word with respect to using IME. Then an online algorithm is proposed to adjust the word likelihood or generate new words by comparing user true choice for inputting and the algorithm prediction. The experimental results show that the proposed solution can agilely adapt to diverse typings and demonstrate performance approaching highly-optimized IME with fixed vocabulary.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Pinyin Input Method for Mobile Phone

Chinese input method is one of the most difficult problems in Chinese Language Processing. And to input Chinese word in mobile phone effectively is an even bigger challenge. In this paper, we propose a new Chinese pinyin input method in mobile phone. This method uses a compact statistical bigram based language model. Also, to meet the special requirements of Chinese pinyin input in mobile phone...

متن کامل

Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach

Motivated by Google Sets, we study the problem of growing related words from a single seed word by leveraging user behaviors hiding in user records of Chinese input method. Our proposed method is motivated by the observation that the more frequently two words cooccur in user records, the more related they are. First, we utilize user behaviors to generate candidate words. Then, we utilize search...

متن کامل

KySS 1.0: a Framework for Automatic Evaluation of Chinese Input Method Engines

Chinese Input Method Engine (IME) plays an important role in Chinese language processing. However, it has been subjected to lacking a proper evaluation metric for a long time. The natural metric for IME is user experience, which is a rather vague goal for research purpose. We propose a novel approach of quantifying user experience by using keystroke count and then correspondingly develop a fram...

متن کامل

Why Chinese Web-as-Corpus is Wacky? Or: How Big Data is Killing Chinese Corpus Linguistics

This paper aims to examine and evaluate the current development of using Web-as-Corpus (WaC) paradigm in Chinese corpus linguistics. I will argue that the unstable notion of wordhood in Chinese and the resulting diverse ideas of implementing word segmentation systems have posed great challenges for those who are keen on building web-scaled corpus data. Two lexical measures are proposed to illus...

متن کامل

A Novel Fuzzy Chinese Address Matching Engine Based on Full-text Search Technology

The ability to locate addresses is one of the most important features in an urban geographic information system. Since Chinese geocoding problem cannot be handled by the European geocoding method, some Chinese scholars did special researches on Chinese geocoding. Current researches all focus on address standardizations and models, and pay less attention to the user input and result control. We ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1712.04158  شماره 

صفحات  -

تاریخ انتشار 2017